# Full text of "A priori truncation method for posterior sampling from homogeneous normalized completely random measure mixture models"

## See other formats

arXiv:1507.04528vl [math.ST] 16Jul2015 A priori truncation method for posterior sampling from homogeneous normalized completely random measure mixture models Raffaele Argiento* Ilaria Bianchini' Alessandra Guglielmi ' Abstract This paper adopts a Bayesian nonparametric mixture model where the mixing distri¬ bution belongs to the wide class of normalized homogeneous completely random mea¬ sures. We propose a truncation method for the mixing distribution by discarding the weights of the unnormalized measure smaller than a threshold. We prove convergence in law of our approximation, provide some theoretical properties and characterize its posterior distribution so that a blocked Gibbs sampler is devised. The versatility of the approximation is illustrated by two different applications. In the first the normalized Bessel random measure, encompassing the Dirichlet process, is introduced; goodness of fit indexes show its good performances as mixing measure for density estimation. The second describes how to incorporate covariates in the sup¬ port of the normalized measure, leading to a linear dependent model for regression and clustering. Keywords: Bayesian nonparametric mixture models • normalized completely random measure • blocked Gibbs sampler • finite dimensional approximation • a priori trun¬ cation method 1 Introduction One of the livelier topic in Bayesian Nonparametrics concerns mixtures of parametric den¬ sities where the mixing measure is an almost surely discrete random probability measure. The basic model is what is known now as Dirichlet process mixture model, appeared first in Lo (1984), where the mixing measure is indeed the Dirichlet process. Dating back to Ishwaran and James (2001) and Lijoi et al. (2005), many alternative mixing measures have been proposed; the former paper replaced the Dirichlet process with stick-breaking random probability measures, while the latter focused on normalized completely random measures. *CNR-IMATI El raffaele@mi.imati.cnr.it tPolitecnico di Milano E {ilaria.bianchinijjalessandra.guglielmiJ@polimi.it 1 e-NormCRM mixtures 2 These hierarchical mixtures play a pivotal role in modern Bayesian Nonparametrics, since their potentialities range within many applications. Indeed, they can easily be ex¬ ploited in very different contexts: for instance, graphical models, topic modeling or biologi¬ cal applications. Their popularity is mainly due to the high flexibility in density estimation problems as well as in clustering, which is naturally embedded in the model. Often the Dirichlet Process prior is employed as mixing measure because of its mathe¬ matical and computational tractability: however, in some statistical applications, clustering induced by the Dirichlet process may be restrictive. In fact, it is well-know that the lat¬ ter allocates observations to clusters with probabilities depending only on the cluster sizes, leading to the "the rich gets richer" behavior. Within some classes of more general processes, as, for instance, stick-breaking and normalized processes, the probability of allocating an ob¬ servation to a specific cluster depends also on extra parameters, as well as on the number of groups and on the cluster's size. We refer to Argiento et al. (2015) for a recent review of state of art on Bayesian nonparametric mixture models and clustering. Since, when dealing with nonparametric mixtures, the posterior inference involves an infinite-dimensional parameter, this may lead to computational issues; this limit prevents applied statisticians from exploiting models beyond Dirichlet process mixtures when deal¬ ing with modern real-life applications. However, there is a recent and lively literature focus¬ ing mainly on two different classes of MCMC algorithms, namely marginal and conditional Gibbs samplers. The former integrate out the infinite dimensional parameter (i.e. the ran¬ dom probability), resorting to generalized Polya urn schemes; see Favaro and Teh (2013) or Lomeli et al. (2014). The latter include the nonparametric mixing measure in the state space of the Gibbs sampler, updating it as a component of the algorithm; this group includes the slice sampler (see Griffin and Walker, 2011). Among conditional algorithms there are trun¬ cation methods, where the infinite parameter (i.e. the mixing measure) is approximated by truncation of the infinite sums defining the process, either a posteriori (Argiento et al., 2010; Barrios et al., 2013) or a priori (Argiento et al., 2015; Griffin, 2013). In this work we introduce an almost surely finite dimensional class of random proba¬ bility measures that approximates the wide family of homogeneous normalized completely random measures (Regazzini et al., 2003; Kingman, 1975); we use this class as the building block in mixture models and provide a simple but general algorithm to perform posterior inference. Our approximation is based on the constructive definition of the weights of the completely random measure as the points of a Poisson process on JR h . In particular, we consider only points larger than a threshold e, controlling the degree of approximation. The construction given here generalizes Argiento et al. (2015) where the particular class of nor¬ malized generalized gamma processes was considered. Conditionally on e, our process is e-NormCRM mixtures 3 finite dimensional either a priori and a posteriori. As detailed later, the two main ingredients to build a normalized completely random measure are p(s), s > 0, the intensity of the Poisson process determining the weights of the measure on the one hand, and the so-called centering measure Pq(-), characterizing the locations of the measure, on the other. Here we illustrate two applications. In the first, a new choice for p is proposed: the Bessel intensity function, that, up to our knowledge, has never been applied in a statistical framework, but in finance (see Barndorff-Nielsen, 2000, for instance). On the other hand, we fix the centering measure Po to be the normal inverse-gamma, a conjugate choice when the kernel is Gaussian. We call this new process normalized Bessel random measure. In the second application, we set p to be the well- known generalized gamma intensity and consider a centering measure Pox depending on on a set of covariates x, yielding a linear dependent normalized completely random measure. For a recent survey on dependent nonparametric processes in the Statistics and Machine Learning literature see Foti and Williamson (2015). In this paper, since the main objective is the approximation of the nonparametric process arisen from the normalization of completely random measures, we fix e to a small value. However, it is worth mentioning that it is possible to elicit a prior for e, but the computa¬ tional cost might greatly increase for some p. The main achievements of this works can be summarized as follows: first we show that, for e going to zero, the finite dimensional e-approximation of homogeneous normalized completely random measures converges to its infinite dimensional counterpart, and com¬ pute its prior moments (Sections 3 and 4). Then we provide a Gibbs sampler for the e -approximation hierarchical mixture model (Section 5). Section 6.1 is devoted to the intro¬ duction of the normalized normalized Bessel random measure, and some of its properties; on the other hand. Section 6.2 discusses an application of the e-Bessel mixture models to both simulated and real data. Section 7 defines the linear dependent e-NGGs, and consider linear dependent e-NGG mixtures to fit the AIS data set. To complete the set-up of the paper. Section 2 is devoted to a summary of basic notions about homogeneous NRMIs, and Section 8 contains a conclusive discussion. 2 Preliminaries on homogeneous normalized completely random measures Let us briefly recall the definition of a homogeneous normalized completely random mea¬ sure. Let 0 C Ift m for some positive integer m. A random measure // on 0 is completely ran¬ dom if for any finite sequence B\, E> 2 , ■ ■., of disjoint sets in 13(0), p(Bi), p(B 2 ),..., p(Bk ) e-NormCRM mixtures 4 are independent. A purely atomic completely random measure is defined (see Kingman, 1993, Section 8.2) by p(-) = Ylj>i Jj (•)/ where the {{Jj, +])}f>\ are the points of a Poisson process on M + x 0. We denote by v{ds , dr) the intensity of the mean measure of such a Pois¬ son process. A completely random measure is homogeneous if v{ds, dr) = p(s)dsnPo(dT), where p(s) is the density of a non-negative measure on M + , while kPq is a finite measure on 0 with total mass k > 0. If p is homogeneous, the support points, that is {r ; }, and the jumps of p, {Jj}, are independent, and the r/s are independent identically distributed (iid) random variables from Pq, while { Jj} are the points of a Poisson process on M + with mean intensity p. Furthermore, we assume that p satisfies the following regularity conditions: /*+OO /»+OO (1) / min{l, s}p(s)ds < oo and / p(s)ds = + oo, Jo Jo so that, if T := p(0) = Ylj>i Jj' 1P(0 < T < +oo) = 1. Recall that the distribution of T is uniquely determined by its Laplace transform, given by: r+oo (2) E(e -Ar ) = exp {— k / (1 — e~ Xs )p(s)d-s}, A > 0. Jo Therefore, a random probability measure (r.p.m.) P can be defined through normalization of p: ( 3 ) +00 E % • 3 =1 We refer to P in (3) as a (homogeneous) normalized completely random measure with pa¬ rameter (p, kPq). As an alternative notation, following James et al. (2009), P is referred as a homogeneous normalized measure with independent increments. The definition of normal¬ ized completely random measures appeared in Regazzini et al. (2003) first. An alternative construction of normalized completely random measures can be given in terms of Poisson- Kingman models as in Pitman (2003). 3 ^-approximation of normalized completely random measures The goal of this section is the definition of a finite dimensional random probability measure that is an approximation of a general normalized completely random measure with Levy's intensity given by u(ds, dr) = p(ds)Kp)[dT), introduced above. First of all, by the Restriction Theorem for Poisson processes, for any e > 0, all the jumps {Jj} of p larger than a threshold £ are still a Poisson process, with mean intensity 7 S (s) := Kp(s)I( £j+0O )(s). Moreover, the total number of these points is Poisson distributed, i.e. AL T’o(Ae) where r+oo A £ := 7 e(M + ) =K p(s)ds. e-NormCRM mixtures 5 Since A £ < +oo for any e > 0 thanks to the regularity conditions (1), N e is almost surely finite. In addition, conditionally to N e , the points { J\,..., Jj \ e } are iid from the density (4) Pe(s) 7e(g) a £ A e ^(e,+oo)(®)j thanks to the relationship between Poisson and Bernoulli processes; see, for instance. King- man (1993), Section 2.4. However, in this case, while PfX^i Jj < °°) = 1/ the condition on the right hand side of ( 1 ) is not satisfied, so that P(X/J=i Jj = 0) > 0, or, in other terms, P (N e = 0) > 0 for any e > 0. For this reason we consider N e + 1 iid points {Jo, J \,..., Jn s } from p £ and define the completely random measure p £ {-) = J?A, (■)/ as well as its normalized counterpart: N e (5) p =(') = E p A 3=0 Ji N e = y —a . T 3 3=0 ±e (•), where 77 = J2f=oJj' T j ~ Po> { T j} and {Jj} independent. We denote P £ in (5) by e- NormCRM and write P £ ~ £—NormCRM(p,nPo). Whenp £ (s) = l/(w°T(— a, uj£))s~ a ^ 1 e~ uls , s > e, P £ is the e-NGG process introduced in Argiento et al. (2015), with parameter (a, n, Pq), 0 < a < 1 , k > 0 . Both the infinite and finite dimensional processes defined in (3) and (5), respectively, belong to the wide class of species sampling models, deeply investigated in Pitman (1996), and we use some of the results there to derive ours. Let ( 6 1 ,..., 0 n ) be a sample from (3) or (5) (or more generally, from a species sampling model); since it is a sample from a discrete probability, it induces a random partition p n := {C\ ,..., 67 , } on the set N n := {1,..., n} where Cj = {i : 0 t = 0* } for j = I..... k. If 4{C r = ri t for 1 < i < k, the marginal law of (0i,...,0 n ) has unique characterization: k C{p n ,0l,... ,91) = p(ni,... ,n k )Y[£(9j), 3 = 1 where p is the exchangeable partition probability function (eppf) associated to the random probability. The eppf p is a probability law on the set of the partitions of N n . The following proposition provides an expression for the eppf of a general e-NormCRM. Proposition 1. Let (m ,..., n^) be a vector of positive integers such that Yli=i n * = n ■ Then, the eppf associated with a P £ ~ e-NormCRM(p, kPq) is u n 1 {k + A £ , u ) (Ae u _ Ae) r(n) A £ Ks ni e us p(-s)ds ( 6 ) p £ (m . ,n k ) du e-NormCRM mixtures 6 ivhere ( 7 ) r+oo A-e,u • — := K S p(s)ds, u > 0. Proof. We have +oo A*« p E (ni,...,n k ) = Y Pe{ni,... ,n k \N £ )^-je -A e N e =0 since N e ~ Poi( A e ). Then, equation (30) in Pitman (1996) yields p £ (m,...,n k \N £ ) jv e +i}(*0 Y ji,-,jk \i=l / where the vector {j \,..., j k ) ranges over all permutations of k elements in {0,, N £ }. Then, using the gamma function identity, 1/T e n = J 0 +o ° 1/T (n)u n ~ 1 e~ uTe du, we have: n k JTti p £ (m,..,n k \N e )= i {li „ m>Ne+ iy(k) y *=i = I JV £ +1 }(fc) ^ r+OO du 31 1 " r+oo r+oo \ II / e- J ‘ u Pc(Jj)dJj 31,-,3k} ° / Now, by the definition of p £ in (4) and adopting the notation p(s) := np{s), it is straightfor¬ ward to see that Pe{ni,..,n k \N £ ) = I{i,...,jv e+ i}(fc) Y P+OO du 1 31,—-,3k r(n) ■n 1=1 ' r+oo A, / +oo 3<t\3l,-,3kt -JjU P(Jj \ dJj A, J By (8), we have +oo p £ (n 1 ,...,n k ) = Y hi,-,N £ +i}(k) Y r+oo du u"” 1 1 N e = 0 31,-,3k o \T(n) Af ^ k P+OO II J ^N e -\-l—k n .+ {+..//, 1 p+oo +oo —A e /*+oo f „,n-l / p+oo A^ e =0 y 0 h e~ JjU PiJjWj N e +l-k Yi e -^ N £ \ e~ JjU p(J j )dJ j / K r+oo \ x E (11/ jdu. e-NormCRM mixtures 7 Denoting by N na := N e + 1 — k the number of non-allocated jumps, we get Pe(ni,...,n k ) = ,+oo j' u n -1 g—A e g jV £ + 1 \r(n) A e 2^ o (N £ + l-k)\ {1 ’->^ +1 }(*) a +oo \ iV e +l—fc / k i>+oo \ 'j e~ JjU p(J j )dJ j j f J] J \du r+oo ^n—1 g—A e ^ /*+oo +oo /o r (n) A s n / d^e~ Jj i u p(Jj i )dJj i £ Afr^i^Ir N na =0 V £,U at | 1 {l,...,A r e + l j V TiV7 • }(fc) du. Since the last summation adds up to e Ae -" (A s . n + k), the p e (ni,..., n k ) and the right hand- side of (6) coincide. □ A result concerning the eppf of a generic normalized (homogeneous) completely random measure can be readily obtained from Pitman (2003), formulas (36)-(37): r+oo „,n— 1 , (r+oo \ (9) p( ni , ...,n k ) = —e+ °° {e ~ US ~ lMs)ds ( ]I J o Ks ni e~ us p(s)dsj du. Now we are ready to show that the eppf of (5) converges pointwise to that of the corre¬ sponding (homogeneous) normalized completely random measure (3) when e —>• 0. Proposition 2. Let p £ f) be the eppf of a £—NormCRM(p, kPq). Then for any sequence n\,..., n k of positive integers with k > 0 and ffn=i n i = n > (10) lim p e (n \, ...,n k ) = p 0 +i ,.. .,n k ), £->0 where p o(-) is the eppf of the NormCRM(p, kPq) as in (9). Proof. By Proposition 1 where r+oo Pe{ni, ■ ■ ■ ,n k ) = / fe(u;ni,... ,n k )du Jo (11) fe(u-ni,...,n k ) = ^— e (A s ,u f ns ni e us p(s)ds, u> 0 r W A £ f = , Je On the other hand, the eppf of a NormCRM(p, kPo) can be written as f+oo Po(ni,...,n k ) = / /o(u;ni,... ,n k )du, where u n ~ 1 f r+oo 'i f r+oo fo(u] ni,, n k ) = p^y exp y (e _ “ s - l)p(s)ds| J n.s ni e~ us p(s), u> 0. e-NormCRM mixtures 8 We first show that (12) limf £ (u;n 1 ,...,n k ) = f(u;n 1 ,...,n k ) for any u > 0. £—►0 In particular, we have that O+OO lim £—^0 s ni e- us p(s)ds s ni e- us p(s)ds and r»+oo lim e Ae ’ u Ae = exp | k J (e us — l)p(s)ds| , being this limit finite for any u > 0. Using standard integrability criteria, it is straightfor¬ ward to check that, for any u > 0, linv^o A £;U = lim e _>o A e = +oo and they are equivalent infinite, i.e. k + A £ u A £ u 1 lim--- = lim - = 1. £- ^0 i\.£ £- ^0 i\.£ We can therefore conclude that (12) holds true. The rest of the proof follows as in the second part of the proof of Lemma 2 in Argiento et al. (2015), where we prove that (i) lim £ ^ 0 Xcen„ Ve{n\, ■ ■ ■, n k ) = 1; (ii) lim inf e _».o Pe(«i, ■ ■ ■, n k ) Po(n\,..., n k ) for all C = (Ci, ..., C k ) G IT n , the set of all partitions of {1,2,..., n}; (in) Xcgn„ To(ui,..., n k ) = 1. By Lemma 1 in Argiento et al. (2015), equation (10) follows. □ Convergence of the sequence of eppfs yields convergence of the sequences of e-NormCRMs, generalizing a result obtained for e-NGG processes. Proposition 3. Let P £ be a e-NormCRM(p , nPo),for any e > 0. Then P e 4P(!S£->0, where P is a NormCRM(p , kPq). Moreover, as e —t +oo, P £ -4 S T0 , where r 0 ~ Pq. Proof Since P £ is a proper species sampling model, p £ defines a probability law on the sets of all partitions of {1,..., n}, for any positive integer n; let (TVf,.... Nf) denote the sizes of the blocks (in order of appearance) of the random partition GV.n defined by p £/ for any e > 0. The probability distributions of {(A r f ,... ,N^),e > 0} are proportional to the values of p £ (for any e > 0) in (2.6) in Pitman (2006). Hence, by Proposition 2, for any k = 1 , ... ,n and any n, (Aff, • • •, N k ) —> (A} 5 ,..., N®) as e —> 0, e-NormCRM mixtures 9 where (N ®,..., N®) denote the sizes of the blocks of the random partition C £ , n defined by Po, the eppf of a NormCRM(p, kPo) process. By formula (2.30) in Pitman (2006), we have d n n—>•+oo > (p?) £—^0 Nl n n—>•+oo > (Pi) where PJ and Pj are the j-th weights of a c-NormCRM and a NormCRM process (with parameters (p, kPq)), respectively. We prove that Ej>o Pf^' •' ' •' n—H-oo * 'Dj]> 0 Pj ^ T j ’ where to, ti, 12 ,... are iid from Po and this ends the first part of the Proposition. Convergence as e —>• +00 is straightforward as well. In fact, when e increases to + 00 , there are no jumps to consider in (5) but the extra Jo, so that P £ degenerates on d TQ . □ Let 6 = ( 61 ,..., 9 n ) be a sample from P e , a e-NormCRM(p, kPo ) as defined in (5), and let d* = ( 6 *,... ,0%) be the (observed) distinct values in 6 . We denote by allocated jumps of the process the values P;*, P;*,..., Pi* in (5) such that there exists a corresponding location for which T t * = 0 *,i=l,...,k. The remaining values are non-allocated jumps. We use the superscript ( na ) for random variables related to non-allocated jumps. We also introduce the random variable U := T n /T e , where r„ ~ gamma(n, 1), being T n and T e independent. Proposition 4. If P e is an e—NormCRM(p, hPq), then the conditional distribution of P s , given 6 * and U = u, verifies the distributional equation k PU-) = w+ (1 - W) r, w * ; (0 3 = 1 where 1. pDfD (•), the process of non-allocated jumps, is distributed as a e—NormCRM(e u 'p {-), kPq), given that exactly N na jumps of the process were obtained, and the posterior law of N na is A-e,u k + A £jU Pi(A £ jU ) + k + A £)U -Po(A-e,v being A £)U as defined in (7), and denoting VfX) the shifted Poisson distribution on {i, i+ 1, * + 2,...} with mean i + A, i = 0,1; 2. the allocated jumps {p[ a \ ..., pjf' 1 } associated to the fixed points of discontinuity G* = (PJ,..., 01) of P* are obtained by normalization of J? i e~ ujj e~ ujj p(Jj)I( e i+00 )(Jj), for j = 1 • • •, k; e-NormCRM mixtures 10 3. Pe™u\-) and { j["\ • • • , J^} are independent, conditionally to l* = (/*,..., If), the vector of locations of the allocated jumps; 4. w is defined as 0 when N na = 0, otherwise w = T £tU /(T £ ^ u + Ylj =l )■ T £i „ is the total sum of the jumps in representation of P £ ff\-) as in (5); 5. the posterior law ofU given G* has density on the positive real given by A i U k r +oo fu\9*(u\0*) oc u n - 1 e A '‘ u ~ A ' e ’ u I] / Acs ni e- us p(s)ds, u > 0 . This proposition is the "finite dimensional" counterpart of Theorem 1 in James et al. (2009). Proof. The first steps of the proof are the same as in the proof of Proposition 2 in Argiento et al. (2015); in particular, the joint law of 6, u, P £ , £(G, u, P £ ), is as in (16) in Argiento et al. (2015). The conditional distribution of P £ , given U = u and 0, is as follows: (13) jC(P £ \u,G) = £{t,J,N £ \u,G) = jC(t,J\N £ ,u,G)jC{N £ \u,G), where the second factor in the right hand side is proportional to C(N £ ,u, G) = J dJo ... dJN e dro ... dTN e C(r, J, N e ,u, G) £ {[n [ PM*)Mni)dJ n dr n I!.5 l 4= f e~ uJj p £ (Jj)Po(Tj)dJjdr n* J ( r k n, —uJi* —uJj £ n / g ip-,i* k 1 v n - l e~^ A ^‘ T(n N e \ e ~ aui te~ UJl *i p (J lt ) X ~ dJi*P 0 (9i* n \ \ u1X 1 -A e ^fjt_ A £ )J T(n) N e l ,71— 1 Af e U J. v £ rR e X £ (AnX i=1 u - 1 _A e A ^ +1_fc (A^e + 1) T(n) A, (N e + 1 - r+OO \ ^JV e + l —fe ^e-™ P (s)ds) +OO ns ni e~ us p(s)ds ). We have already introduced in this paper N na = N e + 1 — k, the number of non-allocated jumps. Of course, the conditional distribution £(N e \u, G) in (13) is identified by C(N na \u, G), e-NormCRM mixtures 11 which can be derived as (14) £(N na \u, 6) (x £(N na , u, 6) oc A ; A, N na ( N na + k) na £,U N na \ k „-A e.ul ll£ ’ u A N na -1 , * \N na nr Ke ' CJV M - i)! A «-“ + ivj A '.“ )*<«"“«> A„ oc -'Pi (Nna, A £)U ) + k -Vo(N na ] A £jU ). Ku+k v ’ ' Ku+k On the other hand, the first factor in the right hand side of (13) can be computed by intro¬ ducing l* = (l *,..., l* k ), the vector of indexes of the allocated jumps and by observing that the augmented right hand side of (13) N na -\-k —1 C(J,T,l*\Nna,U,0) = IT Pe(Jj)Po(Tj)e 1 ‘1 k l k , , uJ i 3=0 (15) 11' Ci=1 1 rii np(Ji* e uJ lV _r^ )p n e -uj 0 ^Mp Q{Tj) \ N e +1 (iL 7 ? 6 UJl ^p(Jit)^Po(ri t )) [ n e~ uJj np(Jj)Po{jj ) j . v *=i ! 1 J W/ The first factor in the last expression refers to the unnormalized allocated process: the support is 9*. This shows point 2. of the Proposition. Therefore, the conditional distribution of P e is proportional to the following expression: ( IT e~ uJj Kp(Jj )Pq (tj ) j H=1 ' 1 7 +1 ~ k —a (N £ + 1) X A, 6 (N e + 1- s ni ne~ us p(s)ds. This yields points l.,3. and 4. of the Proposition. To show point 5., we need to integrate out N e from C(N e ,u, 6 ); we have: +oo +°° 7/ n— 1 ^Ne + l — k 7U 1 k r+oo C{u\9) oc Y, P( N e,u,d) = Y Y(nj e ~ Ae \ (N e + 1 - k)\ J[l p ^ ds N e =0 N e =0 „,n -1 +°° \Nna AT I U / k f+OO -»• V %^%±A(n / Ks ni e~ us p(s)ds T(n N na = 0 A e A t na \ T(n This ends the proof. A, IT / Ks ni e us p(s)ds i= 1 ' □ e-NormCRM mixtures 12 4 Prior moments of P £ Before deriving the first two moments of P £ , let us mention that the expect value and vari¬ ance of N e , the number of jumps considered in the approximation P £ , depend on the prior of e. Of course, if e is assumed fixed, IEfAO) = Var(Af £ ) = A £ < +oo, while, if e is random, then E(JV e ) = E((JV e |e)) = E(A £ ), Var(AT £ ) = Var(A £ ) + E(A e ). In this case, the mean and variance of N £ are not necessarily finite; see, for instance. Ta¬ ble 2 in Argiento et al. (2015), where P £ is the e-NGG process, and for some values of its hyperparameters the mean or the variance of N e are infinite. First of all, observe that (16) {xi + + «,.)”•= £ miH- =m >0 m m i,, ,m N > N* n 3 =1 lib ^ = E'fi.«;>(*% E n\-\ - 1 -rik=m rij= 1,2,... n- '3 where N* = N e + 1, Xj = 1 for all x 3 > 0, and the last summation is over all positive integers, being (16) the multinomial theorem. The second equality follows straightforward from different identifications of the set of all partitions of m (see Pitman, 2006, Section 1.2). Therefore, for any B € 13(0), m = 1,2,..., we have (here, instead of P, and To as in (5), there are Pn* and tn*): N* E(P e (B) m ) = E j E ( (J2 P 3 6 rj( B )) m \N £ 3=1 ( = E / E \ E rn miH- \-m N *=m v ’ Jv e' j=l N* H(P 3 5 Tj (B)ry\N £ V V >0 J ( = E = E / E 1 \ E%.E vni n . K - ni+ ...+n k =m ' > n fc nj= i,2,... k=\ m E ) \n £ <31, —Jk i= l ( \ k =1 niH-| -nk=m ■n j = 1,2,... m n\ , • • •, fik E E (II P ji\ N z) IT ( B )\ N e 31,-,3k i= l 1=1 e-NormCRM mixtures 13 = E m ^ £*e.«?)(*% £ k= 1 niH-(- nk=m n.j =1,2,... m 111, . . • , Tlk Pe(ni, ■ ■ ■ ,n k )(P 0 (B)) k We identify this last expression as E^P 0 (H) fc P(/i m = fe|iV e ) N ) , \k =1 where K m is the number of distinct values in a sample of size in from P £ . Hence, we have proved that E (P £ (B) m ) = E ( E(P 0 (B) k ™\N £ )) = E ( P 0 (B ) Km ) . In particular, when m = 2, K m assumes value in {1, 2}, and the probability that K 2 = 1 is the probability that, in a sample of size 2 from P e , the samples values coincide, i.e. p £ ( 2). Therefore E(P e (B) 2 ) = P 0 (B)p £ (2 ) + (P 0 (B)) 2 (1 -p e (2)), and consequently (17) Var (P £ (B)) = P 0 (B)p £ (2) + P 0 (B) 2 ( 1 - p £ ( 2)) + P 0 (B) 2 = p £ (2)P 0 (B) (1 - P 0 (B)). Analogously, suppose that B i, B 2 € 13(0) are disjoint. Therefore N* N* E(P e (B 1 )P e (B 2 ))=E M £%<£!)£ PiS n (B 2 )\N e .1=1 /=i )) / = E = E = E / E N* \ Y PjS Tj (B 1 n B 2 ) + Y ^) S n ( p 2 )\Ne 3 = 1 1+3 \ \ \ ZA? \H=i,-,A / P 0 {B 1 )P 0 {B 2 ) Y ^ P 3 P l\ N e) V 1+3 j-l I.V = Po(Pi)P 0 (P 2 )p e (l,l). y The general case when /i| and /i 2 are not disjoint follows easily: E(P £ ( J B 1 )P e (5 2 )) = E ((P e (Pi n S 2 )) 2 ) + E (P e (Pi \ P 2 )P £ (Pi D P 2 )) + E (P £ (P 2 \ Pi)P e (Pi n P 2 )) + E(P e (P! \ B 2 )P £ (B 2 \ Pi)), e-NormCRM mixtures 14 where now the sets are disjoint. Applying the result above we first find that E(P e (S!)P e (S 2 )) = Pe(2)P 0 (B 1 n B 2 ) + (1 - p £ (2))P 0 (B 1 )P 0 (B 2 ), and consequently: Cov((P e (B 1 ), P £ (B 2 )) = p £ ( 2) (P 0 (Si n P 2 ) - P 0 (Si)fb(S 2 )) • 5 e-NormCRM process mixtures Among the wide range of applications in which discrete random probability measures are exploited, hierarchical mixture models, dating back to Lo (1984), are frequently used when dealing with various data structures. Hence, as argued in the Introduction, their role is becoming more and more central in modern Bayesian Nonparametrics. We consider mix¬ tures of parametric kernels as the distribution of data, where the mixing measure is the e-NormCRM(p, kPq). The model we assume is the following: Yi\9i ~ d i = l,...,n Oi I P £ ~ P e , i = l,...,n (18) 1 P £ ~ £ — NormCRM ( p , kPq), £ ~ vr(e), where /(•: 9j) is a parametric family of densities on ¥ C M p , for all 0 G 0 C M m . Remember that /}) is a non-atomic probability measure on 0, such that E(P £ (A)) = P\i(A) for all A € B(&) and all e > 0. Model (18) will be addressed here as e—NormCRM hierarchical mixture model. It is well known that this model is equivalent to assume that the Y/s, conditionally on P £ , are independently distributed according to the random density , N e h(y)= / f(y,8)P £ (dd) = YP J Je j=o In particular, we are able to build a blocked Gibbs sampler to update blocks of parameters, which are drawn from multivariate distributions. The parameter is (P e , e, 6), but we use the augmentation trick prescribed by the posterior characterization in Proposition 4, so that the new parameter is (P £ ,s, 0, u); the joint law of e-NormCRM mixtures 15 data and parameters can be written as follows: n C(Y , e , u, P e ,e) = C(Y\0, u, P £ ,£)C(0, u, P £ \e)C(e) = f(Y t : 0i)C(9, u, P £ \e)7 r(e) 2—1 (19) u n— 1 r(n) 3=0 i&C 1 x J % n 7r(£) ieCfc where we used the hierarchical structure in (18). The Gibbs sampler generalizes that one provided in Argiento et al. (2015) for e-NGG mixtures. Description of the full-conditionals is below, and further details can be found in the Appendix. 1. Sampling from C(u\Y, 6 , P £ ,e): from (19) it is easy to see that the factors depending on u identify this full-conditional as gamma with parameters (n, T e ), like the corre¬ sponding prior. 2. Sampling from C(0\u,Y, P s ,e): each 0 lr for i = 1,..., n, has discrete law with support {t 0 , n,..., T Ne }, and probabilities P(0j = tj) oc J } f{Y l \ tj). 3. Sampling from C{P £ , e\u, 0. Y): this step is not straightforward and can be split into two consecutive substeps: 3.a Sampling from C{e\u, 6, Y): see the Appendix. 3.b Sampling from C(P e \e, u. 0. Y)\ via characterization of the posterior in Propo¬ sition 4, since this distribution is equal to C{P e \s, u, 6). To put into practice, we have to sample (i) the number N na of non-allocated jumps, (ii) the vector of the unnormalized non-allocated jumps J^ na \ (Hi) the vector of the unnormalized allo¬ cated jumps J^ a \ the support of the allocated (■ iv ) and non-allocated (v) jumps. See the Appendix for a wider description. Remember that, when sampling from non-standard distributions, Accept-Reject or Metropolis- Hastings algorithms have been exploited. 6 Normalized Bessel random measure mixtures: an application to density estimation In this section we introduce a new normalized process, called normalized Bessel random measure, corresponding to a specific choice for the intensity function p(-). Section 6.1 de- e-NormCRM mixtures 16 scribes theoretical results: in particular, we show that this family encompasses the well- known Dirichlet process. Then we fit the mixture model to synthetic and real datasets in Section 6.2. Results are illustrated through a density estimation problem. 6.1 Definition Let us consider a normalized completely random measure corresponding to mean intensity p(s;uj) = -e _a;s /o(s), s > 0, s where ui > 1 and /,(») = £ ,1 /2 | m= 0 v ' is the modified Bessel function of order v > 0 (see Erdelyi et al., 1953, Sect 7.2.2). It is straightforward to see that, for s > 0, 1 + °° 1 ( 20 > = r"“ + E 2^!p s2 ”- le "“' m= 1 v ’ so that p is the sum of the Levy intensity of the gamma process with rate parameter uj and of the Levy intensities (21) Pm(s-,uj) = 22m( 1 m , )2 g 2m - 1 e- a,a , s>0, m = l,2,... corresponding to finite activity Poisson processes. It is simple to check that (1) holds. Hence, following (3) in Section 2, we introduce the normalized Bessel random measure P, with parame¬ ters (u, k), where ui > 1 and n > 0. Thanks to (20) and the Superposition Property of Poisson processes, in this case, the total mass T in (3) can be written as +oo (22) T = T g +Y^ T m , m= 1 where Tq, T), T2, ... are independent random variables, Tq being the total mass of the gamma process and T m the total mass of a completely random measure corresponding to the intensity u. m {ds,dr) = p m {s)dsKP{ i (dT). In particular, Tq ~ gamma(K,w), while T m = Ylf=o > w h ere ~ Poi(i\T{2m)/({2(jj)‘ 2rn (m\) 2 )), and are the points of a Poisson process on M + with intensity np m . By this notation we mean that T m is equal to 0 when N m = 0, while, conditionally to N m > 0, ~ gamma(2m,u). We can write down e-NormCRM mixtures 17 the density function of T, via (2): V’(A) := — log ^E(e AT )^ = K J (1 — e Xs )p{s;uj)ds 0 /*+OO ] +oo -i /*+oo (1 -e- Xs )-e~ us ds+y _ , / (1 — e _As )s o 2 2 -(m!) 2 7o - A ^„2m-l e - WS& = « ^log = « ^log = AC log T(2m) u; + A\ ^ T(2m) ^ a; J 2-/ 2 2m (m\) 2 u> rn ^ 2 2rn (m\) 2 (ui + A) ' m— 1 v 7 m=l 7 cc + A cc 1 1 2 2 1 1 -log O + o V 1 ~ “3 + log n + nt 1“ 2 2 (w + A) 5 LO + A + \f [ui + A) 2 — 1 + Vw 2 -1 The same expression is obtained when T frit) = k{uj + \/lo 2 — 1) K ^ I K (t), t > 0 (see Gradshteyn and Ryzhik, 2007, formula (17.13.112)). Observe that, when lo = 1, fx is called Bessel function density (Feller, 1971). By (9), the eppf of the normalized Bessel random mea¬ sure is: (23) where p B (n 1 ,...,n k ]io,K) = n k J o + 00 yU— 1 LO + Vlo 2 -1 1 r(n) ycc + u + ^ (lo + u) 2 — 1 J (u + u) T 1 r . . (71 j Ti j -|“1 1 xri r ( n i) 2Fi 3 =1 ('U + Lo ) 2 du, TP ( \ SD ( a l)m ( a2 )m 1 / \m , T(a + m) 2 F 1 (ai,a 2 -,r,z ) := 2 ^ -73-“37 (*) , with (a) m := m=0 (7)r m\ T(a) is the hypergeometric series (see Gradshteyn and Ryzhik, 2007, formula (9.100)). The following proposition shows that the eppf of the normalized Bessel random measure converges to the eppf of the Dirichlet process as the parameter to increases. Proposition 5. Let (m,... ,n k ) be a vector of positive integers such that Yli=i = n > where k = 1 ,n. Then, the eppf (23), associated with the normalized Bessel random measure P with parameter (co, ac), lo > 1, ac > 0, and mean measure Pq, is such that lim p B (ni,n k ; w, ac) = p D (ni,n k ; ac), uj— >-+oo where pD(ni, • • •, n k ; ac) is the eppf of the Dirichlet process with measure parameter kPq. e-NormCRM mixtures 18 Proof. The eppf of the Dirichlet process appeared first in Antoniak (1974) (see Pitman, 1996); anyhow, it is straightforward to derive it from (9): PD(m,... ,n k ;n) = / Jo +oo ^n-1 ^n« , r(n -’> du (U + u) n 3 3 = 1 r+oo u n— 1 T(ra) \co + uJ (u T w) n ^ 1 k T(/tTn) ' . -f- .7 = 1 n r <’ where the last equality follows from formula (3.194.3) in Gradshteyn and Ryzhik (2007). By definition of the hypergeometric function, we have 1 < 2-^1 n 1 rij + 1 1 \ _ T ,_ 2~ ;1 ’ (u + ujY ) ~ 2Fl m n J + 1 .i. 1 2 ’ 2 ’ ’ J- Moreover u + vka 2 — 1 _ uj 1 + — 1/ta 2 (u + 0j) + y/l(u + Lu) 2 - 1) U + U 1 + y/l - l/(ti + w )2 1 + -\/l ~ l/^i 2 < 1 + \/l ~ 1/ia 2 ^ 1 2 _ 1 + y 7 ! - l/(tt + w) 2 _ so that 1 + \J 1 — 1/uP PD(ni, ...,n k -K)< p B (ni,. .. ,n k ;uj,K ) n ^ Tlj Tlj 1 1 \ , , 2G1 I Y J 2 — ;1; ^2 JPz?(ni,...,n fe ;K). f=i ' “ ^ 7 The left hand-side of these inequalities obviously converges to pd(m, • • -, n k : n) as uj goes to Too. On the other hand. "j + 1 .l. 1 2 ’ 2 ’ ’ ca 2 1 clS UJ —V “bOO, thanks to the uniform convergence of the hypergeometric series 2 1 7 ! (^f, " J 2 ' : 1; z) on a disk of radius smaller that 1. We conclude that, for any ni,... ,n k such that n\ + • • • T n k = n, k = 1,. ,. ,n, and any k > 0, lim p B (ni,... ,n k -,uj,K) = p D (ni,... ,n k \ k). LJ —>-+00 Since the eppf is the joint distribution of the number K n of distinct values and corre¬ sponding sizes N\,... ,N k (see equation (30) in Pitman (1996)) in a sample of size n from the normalized Bessel completely random measure, by marginalization we obtain P(K n = k) = - j: 77/15 • • • 5 ^ k P B (jii, ..., n k ), k = 1,... ,n, e-NormCRM mixtures 19 where the sum is over all the compositions of n into k part, i.e., all positive integers such that n\ + ■ ■ ■ + ilk = n - Unfortunately, we were not able to simplify further this last expression, because of the summation of the hypergeometric functions 2 F 1 occurring in the analytic expression (23) of pb- Since the number of partitions of n items in k blocks can be very high (it is given by the Stirling number S(n; k) of the second kind) and the evaluation of 2 -F 1 computationally heavy, we prefer to use a Monte Carlo strategy to simulate from the prior of K n . The simulation strategy is also useful to understanding the meaning of the parameters of the normalized Bessel random measure: k has the usual interpretation of the mass parameter, since, when fixing u, E (K n ) increases with k. On the other hand, the effect of oj is quite peculiar: decreasing uj (thus drifting apart from the Dirichlet process), with k fixed, the prior distribution of K n shifts towards smaller values. However, when E (K n ) is kept fixed, the distribution has heavier tails if uj is small (see Figures 1 and 3 (a)). Figure 1: Prior distribution of K n under a sample from e-NB process with e = 10 - 6 ,w = 1.05 and several values for k, as reported in the legend. 6.2 Application In this section let us consider the hierarchical mixture model (18), where the mixing measure is P £ , the e-approximation of the normalized Bessel random measure, as introduced above (here e-NB(w, kPq) mixture model). Of course, when e is small, this model approximates the corresponding mixture when the mixing measure is P; to the best of our knowledge, this normalized Bessel completely random measure has never been considered in the Bayesian nonparametric literature. By decomposition (22), we argue that this model is suitable when e-NormCRM mixtures 20 the unknown density shows many different components, where a few of them are very spiky (they should correspond to Levy intensities (21)), while there is a folk of flatter components which are explained by the intensity (l/s)e~ ws of the Gamma process. For this reason, we consider a simulated dataset which is a sample from a mixture of 5 Gaussian distributions with means and standard deviations equal to {(15,1.1), (50,1), (20,4), (30, 5), (40, 5)}, and weights proportional to {10, 9,4, 5,5}. The histogram of the simulated data, for n = 1000, is reported in Figure 2. We report posterior estimates for different sets of hyperparameters of the e-NB mixture model when /(•; 0) is the Gaussian density on M and 6 = (p, a 2 ) stands for its mean and vari¬ ance. Moreover, Po(dp, da 2 ) = J\f(dp; y n , a 2 /kq) x inv — gamma(da 2 ; a, b); here Af(y n , v 2 /kq) is the Gaussian distribution with mean y n (the empirical mean) and variance a 2 /kq, and inv — gamma(d(j 2 ; a, b ) is the inverse-gamma distribution with mean b/(a — 1) (if a > 1). We set kq = 0.01, a = 2 and b = 1 as proposed first in Escobar and West (1995). We shed light on three sets of hyperparameters in order to understand sensitivity of the estimates under dif¬ ferent conditions of variability; indeed, each set has a different value of p £ ( 2), which tunes the a-priori variance of P £ , as reported in (17). We tested three different values for p £ ( 2): p e ( 2) = 0.9 in set ( A ), p £ ( 2) = 0.5 in set (B) and p £ ( 2) = 0.1 in set ( C ). Moreover, in each scenario we let the parameter 1/oj ranges in {0.01, 0.25, 0.5,0.75,0.95}; note that the extreme case of oj = 100 (or equivalently 1/a; = 0.01) corresponds to an approximation of the DPM model. The mass parameter k is then fixed to achieve the desired level of p £ ( 2). As far as the choice of e concerns, we set it equal to 1CT 6 : at the end, we got 15 tests, listed in Table 1. It is worth mentioning that it is possible to choose a prior for e, even if, for the p in (20), the computational cost would greatly increase due to the evaluation of functions 2 Pi in (23). We have implemented our Gibbs sampler in C++. All the tests in Sections 6 and 7 were made on a laptop with Intel Core i7 2670QM processor, with 6GB of RAM. Every run pro¬ duced a final sample size of 5000 iterations, after a thinning of 10 and an initial burn-in of 5000 iterations. Every time the convergence was checked by standard R package CODA tools. Here, we focus on density estimation: all the tests provide similar estimates, quite faithful to the true density. Figure 2 shows density estimate and pointwise 90% credibility intervals for case A5; the true density is superimposed as dashed line. Figure 3 (a) and (b) display prior and posterior distributions, respectively, of the number K n of groups, i.e. the number of unique values among (9 1 ,..., 9 n ) in (18) under two sets of hyperparameters, Al, representing an approximation of the DPM model, and A5, where the parameter u> is nearly 1. From Figure 3 it is clear that A5 is more flexible than Al: for case A5, a priori the variance of K n is larger, and, on the other hand, the posterior probability mass in 5 (the true value) is larger. e-NormCRM mixtures 21 CM 10 20 30 40 50 simulated data Figure 2: Density estimate for case A5: posterior mean (line), 90% pointwise credibility intervals (shadowed area), true density (dashed) and the histogram of simulated data. (a) (b) Figure 3: Prior (a) and posterior (b) distributions of the number K n of groups for test A1 (gray) and A5 (blue). In order to compare different priors, we take into account five different predictive good- ness-of-fit indexes: (i) the sum of squared errors (SSE) , i.e. the sum of the squared differ- e-NormCRM mixtures 22 ences between the y t and the predictive mean E (Yi\data) (yes, we are using data twice!); (ii) the sum of standardized absolute errors (SSAE), given by the sum of the standardized error | yi — E( Y t \ data) |/ y/ Var( Y,\data ); (in) log-pseudo marginal likelihood (LPML), quite standard in the Bayesian literature, defined as the sum of log (CPOi), where CPOi is the conditional predictive ordinate of y lr the value of the predictive distribution evaluated at y„ conditioning on the training sample given by all data except yi. The last two indexes, (iv) WAIC\ and (v) WAIC 2 , as denoted here, were proposed in Watanabe (2010) and deeply analyzed in Gelman et al. (2014): they are generalizations of the AIC, adding two types of penalization, both accounting for the "effective number of parameters". The bias correc¬ tion in WAIC\ is similar to the bias correction in the definition of the DIC, while W AIC 2 is the sum of the posterior variances of the conditional density of the data. See Gelman et al. (2014) for their precise definition. Table 1 shows the values of the five indexes for each test: the optimal (according to each index) tests are highlighted in bold for the experiments ( A ), (B) and ( C ). It is apparent that the different tests provide similar values of the indexes, but SSE, indicating that, from a predictive viewpoint, there are no significant differences among the priors. However, especially when the value of k is small, i.e. in all tests A and B, a model with a smaller oj tends to outperform the Dirichlet process case (approximately, when oj = 100). On the other hand, the SSE index shows quite different values among the tests: it is well-known that this is a index favoring complex models and leading to better results when data are over-fitted. Therefore, tests with an higher value of k are always preferable according to this criterion. We fitted our model also to a real dataset, the Hidalgo stamps data of Wilson (1983) consisting of n = 485 measurements of stamp thickness in millimeters (here multiplied by 10 3 ). The stamps have been printed between 1872 and 1874 on different paper types, see data histogram in Figure 4. This dataset has been analyzed by different authors in the context of mixture models: see, for instance, Izenman and Sommer (1988), McAuliffe et al. (2006) and Nieto-Barajas (2013). We report posterior inference for the set of hyperparameters which is most in agree¬ ment with our prior belief: the mean distribution is Po(dg, da 2 ) = Afidfi: y n . a 2 / kq) x inv — gamma(da 2 ; a, b) as before, and ko = 0.005, a = 2 and b = 0.1. The approximation param¬ eter s of the e-NB(w, kPI) random measure is fixed to 10 -6 ; on the other hand, in order to set parameters oj and k, we argue as follows: oj ranges in {1.05,5,10,1000} and we choose the mass parameter k such that the prior mean of the number of clusters, i.e. E (K n ), is the desired one. As noted in Section 6.1, a closed form of the prior distribution of K n is not available, so we resort to Monte Carlo simulation to estimate it. Table 2 shows the four couples of (oj, k) yielding E (K n ) = 7: indeed, according to Ishwaran and James (2002) and e-NormCRM mixtures 23 Table 1: Predictive goodness-of-fit indexes for the simulated dataset. Test cu K SSE SSAE WAIC1 WAIC2 LPML A1 100 0.06 6346.59 811.16 -3312.44 -3312.55 -3312.55 A2 4 0.09 5812.86 810.43 -3312.33 -3312.42 -3312.43 A3 2 0.1 6089.19 810.99 -3312.38 -3312.47 -3312.48 A4 1.33 0.11 6498.23 811.29 -3312.54 -3312.62 -3312.63 A5 1.05 0.11 5725.18 810.39 -3312.27 -3312.36 -3312.36 B1 100 0.43 5184.25 809.61 -3311.95 -3312 -3312.01 B2 4 0.67 5125.41 809.7 -3312.19 -3312.25 -3312.26 B3 2 0.81 4610.39 809.42 -3311.92 -3311.98 -3312 B4 1.33 0.93 4246.43 809.07 -3311.75 -3311.83 -3311.84 B5 1.05 1 4571.09 809.08 -3311.96 -3312.05 -3312.06 Cl 100 1.56 3707.5 809.36 -3311.73 -3311.86 -3311.88 C2 4 2.67 2194.1 808.8 -3312.02 -3312.23 -3312.26 C3 2 3.64 1223.86 809.28 -3312.62 -3312.96 -3312.99 C4 1.33 5.29 748.85 808.7 -3313.05 -3313.51 -3313.54 C5 1.05 8.95 685 807.96 -3312.9 -3313.36 -3313.38 e-NormCRM mixtures 24 (a) (b) Figure 4: Posterior inference for the Hidalgo stamp data for Test 4: histogram of the data, density estimate and 90% pointwise credibility intervals (a); posterior distribution of K n (b). McAuliffe et al. (2006) and references therein, there are at least 7 different groups (but the true number is unknown), corresponding to the number of types of paper used. For an in- depth discussion about the appropriate number of groups in Hidalgo stamps data, we refer the reader to Basford et al. (1997). Table 2 also reports prior standard deviations of K n : even if the a-priori differences are small, the posteriors appear to be quite different among the 4 tests. All the posterior distributions on K n support the conjecture of at least seven distinct modes in the data; in particular. Figure 4 (b) displays the posterior distribution of K n for Test 4. A modest amount of mass is given to less than 7 groups, and the mode is in 11. Even Test 1, corresponding to the Dirichlet process case, does not give mass to less than 7 groups, where 9 is the mode. Density estimates seem pretty good; an example is given in Figure 4 (a), with 90% credibility band for Test 4. As in the simulated data example, some predictive goodness-of-fit indexes are reported in Table 2: the optimal value for each index is indicated in bold. The SSE is significantly lower when u is small, thus suggesting a greater flexibility of the model with small values of c o. The other indexes assume the optimal value in Test 4 as well, even if those values are similar along the tests. e-NormCRM mixtures 25 Table 2: Predictive goodness-of-fit indexes for the Hidalgo stamps data. Test UJ K E(K n ) sd(K n ) SSE SSAE WAIC1 WAIC2 LPML 1 1000 0.98 7 2.04 15.17 384.1 -713.12 -713.96 -714.12 2 10 0.91 7 2.13 12.85 383.51 -713.22 -714.04 -714.25 3 5 0.92 7 2.18 13.52 383.68 -713.52 -714.3 -714.4 4 1.05 1.02 7 2.32 11.12 383.38 -712.84 -713.66 -714.05 7 Linear dependent NGG mixtures: an application to sports data Let us consider a regression problem, where the response Y is univariate and continuous, for ease of notation. We model the relationship (in distributional terms) between the vec¬ tor of covariates x = (x\,.... x p ) and the response Y through a mixture density, where the mixing measure is a collection {P x ,x € X} of e-NormCRMs, being X the space of all possible covariates. We follow the same approach as in MacEachern (1999), MacEachern (2000), De Iorio et al. (2009) for the dependent Dirichlet process. We define the dependent e-NormCRM process { P x . x € X}, conditionally to x, as: N e (24) 3=0 The weights Pj are the normalized jumps as in (5), while the locations 7 'j(x), j = 1,2,..., are independent stochastic processes with index set X and Pq x marginal distributions. Model (24) is such that, marginally, P x follows a e-NormCRM process, with parameter ip, kPq x ), where p is the intensity of a Poisson process on M + , n > 0, and Pq x is a probability on M. Observe that, since N e and Pj do not depend on x, (24) is a generalization of the single weights dependent Dirichlet process (see Barrientos et al., 2012, for this terminology). We also assume the functions x 7 j[x) to be continuous. The dependent e-NormCRM process in (24) takes into account the vector of covariates x only through 7 j{x). In particular, when the kernel of the mixture (18) belongs to the exponential family, for each j, 7 j(x) = p(x: Tj) can be assumed as the link function of a generalized linear model, so that (18) specializes to f(y;-y(xi,di )) i = l,...,n (25) iid di\P £ ~ P £ i = 1,..., n where P £ ~ e — NormCRM(p, kPq). This last formulation is convenient because it facilitates parameters interpretation as well as numerical posterior computation. e-NormCRM mixtures 26 We analyze the Australian Institute of Sport (AIS) data set (Cook and Weisberg, 1994), which consists of 11 physical measurements on 202 athletes (100 females and 102 males). Here the response is the lean body mass (lbm), while three covariates are considered, the red cell count (rcc), the height in cm (Ht) and the weight in Kg (Wt). The data set is con¬ tained in the R package DPpackage (Jara et al., 2011). The actual model (25) we consider here is when /(•; p, rj 2 ) is the Gaussian distribution with // mean and rj 2 variance; moreover, /j = 7 (x,0) = x t 6, and the mixing measure P e is the e-NGG(n, o. Pq), as introduced in Argiento et al. (2015). We have considered two cases, when mixing the variance rj 2 with respect to the NGG process or when the variance i] 2 is given a parametric density; in both cases, by linearity of the mean x t 6, the model (here called linear dependent NGG mixture) can be interpreted as a NGG process mixture model, and inference can be achieved via an al¬ gorithm similar to that in Section 5. We set £ = 10 -6 , cr € {0.001,0.125,0.25}, and k such that E (K n ) ~ 5 or 10. When the variance rj 2 is included in the location points of the s-NGG pro¬ cess, then Po is Ah (bo. E ( ,) x inv-gamma ( uq /2 , vorfi/2); on the other hand, when rj 2 is given a parametric density, then rj 2 ~in v-gamma (vq / 2 , //() //{/ 2 ). We fixed hyperparameters in agree¬ ment with the least squares estimate: bo = (“50,5,0,0), So = diag( 100,10,10,10), uq = 4, r )o = l- For all the experiments, we computed the posterior of the number of groups, the predictive densities at different values of the covariate vectors and the cluster estimate via posterior maximization of Binder's loss function (see Lau and Green, 2007). Moreover, we compared the different prior settings computing predictive goodness-of-fit tools, specifically log pseudo-marginal likelihood (LPML) and the sum of squared errors (SSE), as introduced in Section 6.2. The minimum value of SSE, among our experiments, was achieved when rj 2 is included in the location of the e-NGG process, a = 0.001 and k = 0.8 so that E (K n ) ~ 5. On the other hand, the optimal LPML was achieved when a = 0.125, k = 0.4, and E( K n ) ~ 5. Posterior of K n and cluster estimate under this last hyperparameter setting are in Figure 5 ((a) and ( b ), respectively); in particular the cluster estimate is displayed in the scatterplot of the Wt vs lbm. In spite of the vague prior, the posterior of K n is almost degenerate on 2, giving evidence to the existence of two linear relationships between lbm and Wt. Finally, Figure 6 displays predictive densities and 95% credibility bands for 3 athletes, a female (Wt=60, rcc=3.9, Ht=176 and lbm=53.71), and two males (Wt=67.1,113.7, rcc=5.34,5.17, Ht=178.6, 209.4 and lbm=62,97, respectively); the dashed lines are observed values of the re¬ sponse. Depending on the value of the covariate, the distribution shows one or two peaks: this reflects the dependence of the grouping of the data on the value of x. This figure high¬ lights the versatility of nonparametric priors in a linear regression setting with respect to the customary parametric priors: indeed, the model is able to capture in detail the behavior of the data, even when several clusters are present. e-NormCRM mixtures 27 g _ . - CD co - E CD -Or- #• * v : .*V*’ »• . LO — m • 1 3 Kn 80 Wt (a) (b) Figure 5: Posterior distributions of the number K n of groups (a) and cluster estimate (b) under the linear dependent e —NGG mixture. (a) (b) (c) Figure 6: Predictive distributions of lbm for three different athletes: Wt=60, rcc=3.9 / Flt=176 (a), Wt=67.1, rcc=5.34, Flt=178.6 (b), Wt=113.7, rcc=5.17, Ht=209.4 (c). The shaded area is the predictive 95% pointwise credible interval, while the dashed vertical line denotes the observed value of the response. e-NormCRM mixtures 28 8 Discussion We have proposed a new model for density and cluster estimation in the Bayesian non- parametric framework. In particular, a finite dimensional process, the e-NormCRM, has been defined, which converges in distribution to the corresponding normalized completely random measure, when e tends to 0. Here, the e-NormCRM is the mixing measure in a mixture model. In this paper we have fixed £ very small, but we could choose a prior for e and include this parameter into the Gibbs sampler scheme. Among the achievements of the work, we have generalized all the theoretical results obtained in the special case of NGG in Argiento et al. (2015), including the expression of the eppf for an e-NormCRM process, its convergence to the corresponding eppf of the nonparametric underlying process and the posterior characterization of P e . Moreover, we have provided a general Gibbs Sampler scheme to sample from the posterior of the mixture model. To show the performance of our algorithm and the flexibility of the model, we have illustrated two examples via normalized completely random measure mixtures: in the first application, we have introduced a new normalized completely random measure, named normalized Bessel random measure; we have studied its theoretical properties and used it as the mixing measure in a model to fit simulated and real datasets. The second example we have dealt with is a linear dependent e-NGG mixture, where the dependence lies on the support points of the mixing random probability, to fit a well known dataset. Current and future research is devoted on the use of our approximation on more complex dependence structures. e-NormCRM mixtures 29 APPENDIX: DETAILS ON FULL-CONDITIONALS FOR THE GIBBS SAMPLER Here, we provide some details about Step 3 of the Gibbs Sampler in Section 5. As far as Step 3a is concerned, the full-conditional C(e\u, 6) is obtained integrating out N £ (or equiv¬ alently N na ) from the law C(N e , u. 6), as follows: -boo C{e\u,0,Y)cx Y, £(N na ,£,u,0,Y) N na =0 +oo ( \ -A e ^E,u a (Nna + k ) yy f + °° Y e - jj—i —11/ KS e py s > ds N na =0 / ^ p+oo i= 1 ' I Ks ni e us p(s)ds e Ae ’ u Ae ^ £,u ^ + k ir(e) = f £ (u;ni, \i= 1 ' ,n k ) vr(e), where we used the identity Y'u a (JA na + k)/(N na \) = e As ’ u (A £jU + k ), as previously noted. Moreover, f £ (u ; rii,...,«/,) is defined in (11). This step depends explicitly on the expression of p(s). Step 3.b consists in sampling from C(P s \e,u,0) and has already been described in the proof of Proposition 4. However, for a complete outline of the algorithm, we list the full- conditionals resulting into Step 3b: (i) . £(N na \e, Y,u, 0) = Vi(A £U ) + -— - V o(A £M ); this is formula (14). ^■eu i rv J^-eu H" & (ii) . Non-allocated jumps: iid from £(J :j ) oc e~ uJj p( Jj)tr £j00 \(Jj), j = 1, •.. ,N na ; see the second factor of the last expression in (15). (iii) . Allocated jumps: iid from C{Ji*) oc Jj*e uJl i p(J;*)l( £ ,oo)(^*)/ * = 1, • • •, k; see the first factor of the last expression in (15). (iv). Non-allocated points of support: iid from P 0 ; see (19). (v). Allocated points of support: iid from £(r* ) oc {fl k(Xj-,Ti)}Po( T i), i. = 1,.... k; see (19). References Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian non- parametric problems. The Annals of Statistics 2,1152-1174. Argiento, R., I. Bianchini, and A. Guglielmi (2015). A blocked Gibbs sampler for NGG- mixture models via a priori truncation. Statist. Comp. Online First. e-NormCRM mixtures 30 Argiento, R., A. Guglielmi, C. Hsiao, F. Ruggeri, and C. Wang (2015). Modelling the associ¬ ation between clusters of SNPs and disease responses. In R. Mitra and R Mueller (Eds.), Nonparametric Bayesian Methods in Biostatistics and Bioinformatics. Springer. Argiento, R., A. Guglielmi, and A. Pievatolo (2010). Bayesian density estimation and model selection using nonparametric hierarchical mixtures. Computational Statistics and Data Analysis 54, 816-832. Barndorff-Nielsen, O. E. (2000). Probability densities and Levy densities. University of Aarhus. Centre for Mathematical Physics and Stochastics. Barrientos, A. F., A. Jara, F. A. Quintana, et al. (2012). On the support of MacEacherns dependent Dirichlet processes and extensions. Bayesian Analysis 7(2), 277-310. Barrios, E., A. Lijoi, L. E. Nieto-Barajas, and I. Priinster (2013). Modeling with normalized random measure mixture models. Statistical Science 28, 313-334. Basford, K., G. McLachlan, and M. York (1997). Modelling the distribution of stamp paper thickness via finite normal mixtures: The 1872 Hidalgo stamp issue of Mexico revisited. Journal of Applied Statistics 24(2), 169-180. Cook, R. D. and S. Weisberg (1994). An introduction to regression graphics. John Wiley & Sons. De Iorio, M., W. O. Johnson, P. Muller, and G. L. Rosner (2009). Bayesian nonparametric nonproportional hazards survival modeling. Biometrics 65(3), 762-771. Erdelyi, A., W. Magnus, F. Oberhettinger, F. G. Tricomi, and H. Bateman (1953). Higher transcendental functions, Volume 2. McGraw-Hill New York. Escobar, M. and M. West (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90, 577-588. Favaro, S. and Y. Teh (2013). MCMC for normalized random measure mixture models. Sta¬ tistical Science 28(3), 335-359. Feller, W. (1971). An introduction to probability theory and its Applications, vol. II (Second Edi¬ tion ed.). John Wiley, New York. Foti, N. and S. Williamson (2015). A survey of non-exchangeable priors for Bayesian non¬ parametric models. IEEE Transactions on pattern Analysis and Machine Intelligence 37, 359- 371. e-NormCRM mixtures 31 Gelman, A., J. Hwang, and A. Vehtari (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing 24(6), 997-1016. Gradshteyn, I. and L. Ryzhik (2007). Table of integrals, series, and products - Seventh Edition (Sixth ed.). San Diego (USA): Academic Press. Griffin, J. and S. G. Walker (2011). Posterior simulation of normalized random measure mixtures. Journal of Computational and Graphical Statistics 20, 241-259. Griffin, J. E. (2013). An adaptive truncation method for inference in Bayesian nonparametric models. arXiv preprint arXiv :1308.2045. Ishwaran, H. and L. James (2001). Gibbs sampling methods for stick-breaking priors. /. Amer. Statist. Assoc. 96, 161-173. Ishwaran, H. and L. F. James (2002). Approximate Dirichlet process computing in finite normal mixtures. Journal of computational and graphical statistics 11(3). Izenman, A. J. and C. J. Sommer (1988). Philatelic mixtures and multimodal densities. Journal of the American Statistical association 83(404), 941-953. James, L., A. Lijoi, and I. Priinster (2009). Posterior analysis for normalized random mea¬ sures with independent increments. Scand. J. Statist. 36, 76-97. Jara, A., T. E. Hanson, F. A. Quintana, P. Muller, and G. L. Rosner (2011). DPpackage: Bayesian semi-and nonparametric modeling in R. Journal of statistical softzvare 40(5), 1. Kingman, J. F. C. (1975). Random discrete distributions. Journal of the Royal Statistical Soci¬ ety 37(1), 1-22. Kingman, J. F. C. (1993). Poisson processes, Volume 3. Oxford university press. Lau, J. W. and P. J. Green (2007). Bayesian model based clustering procedures. Journal of Computational and Graphical Statistics 16, 526-558. Lijoi, A., R. H. Mena, and I. Priinster (2005). Hierarchical mixture modeling with normalized inverse-gaussian priors. Journal of the American Statistical Association 100(472), 1278-1291. Lo, A. J. (1984). On a class of Bayesian nonparametric estimates: I. density estimates. The Annals of Statistics 1. Lomeli, M., S. Favaro, and Y. W. Teh (2014). A marginal sampler for u-stable Poisson- Kingman mixture models. arXiv preprint arXiv:1407.4211. e-NormCRM mixtures 32 MacEachern, S. N. (1999). Dependent nonparametric processes. In ASA proceedings of the section on Bayesian statistical science, pp. 50-55. MacEachern, S. N. (2000). Dependent Dirichlet processes. Technical report. Department of Statistics, The Ohio State University. McAuliffe, J. D., D. M. Blei, and M. I. Jordan (2006). Nonparametric empirical Bayes for the Dirichlet process mixture model. Statistics and Computing 16(1), 5-14. Nieto-Barajas, L. E. (2013). Levy-driven processes in bayesian nonparametric inference. Bol. Soc. Mat. Mexicana (3) 19. Pitman, J. (1996). Some developments of the Blackwell-Macqueen urn scheme. In T. S. Ferguson, L. S. Shapley, and M. J. B. (Eds.), Statistics, Probability and Game Theory: Papers in Honor of David Blackwell, Volume 30 of IMS Lecture Notes-Monograph Series, pp. 245-267. Hayward (USA): Institute of Mathematical Statistics. Pitman, J. (2003). Poisson-Kingman partitions. In Science and Statistics: a Festschrift for Terry Speed, Volume 40 of IMS Lecture Notes-Monograph Series, pp. 1-34. Hayward (USA): Insti¬ tute of Mathematical Statistics. Pitman, J. (2006). Combinatorial Stochastic Processes. LNM n. 1875. New York: Springer. Regazzini, E., A. Lijoi, and I. Priinster (2003). Distributional results for means of random measures with independent increments. The Annals of Statistics 31, 560-585. Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely appli¬ cable information criterion in singular learning theory. The Journal of Machine Learning Research 11, 3571-3594. Wilson, I. (1983). Add a new dimension to your philately. The American Philatelist 97, 342- 349.